-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNDB-12553: ensure that memtable is reclaimed even when notification subscribers throw #1545
CNDB-12553: ensure that memtable is reclaimed even when notification subscribers throw #1545
Conversation
Checklist before you submit for review
|
337bcb0
to
67e28f6
Compare
test/unit/org/apache/cassandra/db/memtable/FlushFailingOnNotificationSubscriberTest.java
Outdated
Show resolved
Hide resolved
test/unit/org/apache/cassandra/db/memtable/FlushFailingOnNotificationSubscriberTest.java
Outdated
Show resolved
Hide resolved
cfs.replaceFlushed(memtable, Collections.emptyList(), Optional.empty()); | ||
reclaim(memtable); | ||
return Collections.emptyList(); | ||
try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the patch seems reasonable to me. However, as you said, it is not precisely defined what should happen in case of failure in replaceFlushed
. To me, since we have no answer, perhaps the best way to deal with that would be to shutdown the node, and at the same time, make sure that the notification consumer implementation does not throw any exception. Otherwise - from the CNDB point of view - is a failure in notification consumer critical? Can we continue if it happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM besides nits from Jacek
subscribers throw The direct cause of CNDB-12553 is that CNDB-specific subscriber to SSTableAddingNotification throws an error, and Cassandra doesn't handle it properly. In such case the flush is interrupted (despite multiple uses of exception-safe code and accumulating exceptions) after the sstable creation transaction committed, but before the memtable has been reclaimed. As a result the memtable allocator believes more and more memory is being used and being reclaimed eventually stopping writes due to apparent lack of memory in the memtable. This patch changes memtable flushing behaviour so that the memtable is reclaimed iff it has been removed from the View, regardless of whether the flush fails or not.
67e28f6
to
d262cb3
Compare
|
✔️ Build ds-cassandra-pr-gate/PR-1545 approved by ButlerApproved by Butler |
FYI this is one of the failure scenarios of CNDB's SSTable management that could lead to data loss. |
I disagree. Each storage notification consumer is run independently. Thus, an exception in one notification consumer does not impact our ability to update ETCD in another one. (IMO this is not an argument to keep ETCD update outside of transaction handling; I just wanted to clarify that the consumers are independent) |
What is the issue
Cassandra doesn't properly support throwing notification subscribers
that fail flushes. In such case the flush is interrupted (despite
multiple uses of exception-safe code and accumulating exceptions)
after the sstable creation transaction committed, but before the
memtable has been reclaimed. As a result the memtable allocator believes
more and more memory is being used and being reclaimed eventually
stopping writes due to apparent lack of memory in the memtable.
What does this PR fix and why was it fixed
This patch changes memtable flushing behaviour so that the memtable
is reclaimed iff it has been removed from the View, regardless
of whether the flush fails or not.